I conduct a STM (Strucutral Topic Model) estimation on a sample of 14,936 online news articles from seven news provider about domestic politics: Bild.de, DIE WELT, FOCUS ONLINE, SPIEGEL ONLINE, Stern.de, ZEIT ONLINE, Tagesschau.de. The articles are dated from 01.06.2017 to 01.03.2018 (German federal elections took place on 24th of September 2017.). I first extract all online articles using the the Eventregistry API. Then all articles from the section “domestic policy” are filtered by checking the URL structure. The data cleaning process is documented here.

To discover the latent topics in the corpus, the structural topic modeling (STM) developed by Roberts (2016) is applied. The STM is an unsupervised machine learning approach that models topics as multinomial distributions of words and documents as multinomial distributions of topics, allowing to incorporate external variables that effect both, topical content and topical prevalence. I will included the news provider as a control for both the topical content and the topical prevalence. The number of topics is set to 50.

Data Exploratory Analysis

Distribution of articles

The Figures below show the distribution of the number of articles from the respective news sources by date. There is a high peak around the federal elections on September, 24th.

ggsave({
  btw %>%
  ggplot(aes(site)) +
  geom_bar(fill=col[8], alpha = 0.8) +
  labs(x="", y="Number of articles") +
  theme(
    axis.text = element_text(size = 6),
      legend.position   = "none"
    )
  
},
filename = "../figs/bar.png", device = "png", 
width = 6, height = 4,
        dpi = 600)
plot1

plot1

ggsave({
  btw %>%
  group_by(date) %>%
  dplyr::summarise(obs = n()) %>%
  ggplot(aes(date, obs)) +
  geom_line(color=col[3]) +
  geom_vline(aes(xintercept=as.Date("2017-09-24")),
             linetype = 2, color=col[5]) +
  scale_color_manual(values = col) +
  labs(x="", y="number of articles",color="") +
  scale_x_date(breaks = date_breaks("1 month"), labels=date_format("%B", tz="CET")) +
  theme(
      legend.position   = "none",
      axis.title.x      = element_blank(),
      axis.text       = element_text(size = 8)
    )
},
filename = "../figs/timeline.png", device = "png",width = 6, height = 4,
dpi = 600
)
plot1

plot1

Word Count

output <- describeBy(btw$text_length, btw$site, mat=T)%>%
  select(group1, n, mean, sd, median, min, max, se)
rownames(output) <- c()

print(xtable(output, type="latex"), file="../writing/tables/wordcount.tex" )

describeBy(btw$text_length, btw$site)
## 
##  Descriptive statistics by group 
## group: Bild.de
##    vars    n  mean     sd median trimmed    mad min  max range skew
## X1    1 1277 475.2 319.22    394   431.5 253.52 121 3710  3589 2.48
##    kurtosis   se
## X1    13.77 8.93
## -------------------------------------------------------- 
## group: DIE WELT
##    vars    n   mean     sd median trimmed    mad min   max range  skew
## X1    1 3179 507.88 614.28    377  432.34 257.97 121 14507 14386 13.11
##    kurtosis    se
## X1   248.56 10.89
## -------------------------------------------------------- 
## group: FOCUS ONLINE
##    vars    n   mean     sd median trimmed    mad min  max range skew
## X1    1 2660 402.68 330.86    299  345.64 185.32 121 5647  5526 4.34
##    kurtosis   se
## X1    40.74 6.42
## -------------------------------------------------------- 
## group: SPIEGEL ONLINE
##    vars    n   mean     sd median trimmed    mad min  max range skew
## X1    1 1817 498.96 333.23    387   456.2 253.52 121 3304  3183    2
##    kurtosis   se
## X1     7.95 7.82
## -------------------------------------------------------- 
## group: stern.de
##    vars    n   mean     sd median trimmed    mad min  max range skew
## X1    1 2922 518.09 622.99  376.5  409.87 223.13 121 9287  9166 6.51
##    kurtosis    se
## X1    59.72 11.53
## -------------------------------------------------------- 
## group: Tagesschau.de
##    vars    n   mean     sd median trimmed    mad min  max range skew
## X1    1 1644 450.34 242.93  397.5  422.13 215.72 121 2006  1885 1.41
##    kurtosis   se
## X1     3.45 5.99
## -------------------------------------------------------- 
## group: ZEIT ONLINE
##    vars    n   mean     sd median trimmed    mad min  max range skew
## X1    1 1437 510.98 377.85    459  470.17 217.94 121 8015  7894 9.38
##    kurtosis   se
## X1   149.73 9.97
btw$text_length.cut <- cut(btw$text_length, c(seq(0,2000,1), Inf))
btw <- btw %>% group_by(site) %>%
  mutate(text_length.mean = mean(text_length)) %>%
  ungroup()

ggsave(plot = {
  btw %>%
    ggplot(aes(as.numeric(text_length.cut),
                       group = site)) +
    geom_histogram(aes(y=..density..), alpha=.8, fill = col[3]) + 
    scale_x_continuous(labels = c(seq(0,1500,500), ">2000")) +
    facet_wrap(~site, ncol = 4) +
    labs(x = "Word count", color = "") +
    theme(axis.title = element_text(size = 8),
          axis.text = element_text(size = 6))
  
}, filename = "../figs/wordcount.png", width = 8, height = 4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Word Count

Word Count

Wordclouds

To summarise the content of the texts, wordclouds help to get a first impression, as they represent the number of words in a corpus. Intuitively the term frequency (tf) of a word is a measure of how important that word may be. The following word cloud is derived from the whole corpus. As can be seen, problems arise with high frequency words. For example “die”, “der”, and “in” are extremly common but unrelated to the quantity of interest. These terms, often called “stop words”, are important to the grammatical structure of a text, but typically don’t add any additional meaning and can therefore be neglected. We use a pre-defined stop word list from the Snowball stemmer project together with a customized list of stop-words that are redundant superfluous or distorting. We also remove punctuation character (e.g. ., ,, !, ?, etc.) and all numbers from our corpus.

Unprocessed Text

all.corpus <- corpus(btw$text)
df.corpus <- dfm(all.corpus, removePunct = TRUE, removeSymbols = T)
png("../figs/wordcloud.png")
textplot_wordcloud(df.corpus, max.word=300)
dev.off()
## quartz_off_screen 
##                 2
Wordcloud 1

Wordcloud 1

Processed Text

After completing this steps we were left with 68.576 unique terms in our vocabulary. The following wordclouds is derived from the corpus for each news provider. It becomes evident that these are texts discussing domestic policy issues. The SPD in particular seems to be higly frequent. However, at first glance, there are no obvious differences between the corpus of the different news provider.

Model Results

1. Topic

1.1. Label topics

In order to improve readability and traceability, I assign a shorter name to the topics based on the most common words. The plotQuote function allows to inspect die most common words of a topic for each covariate.

topic <- 2

plotQuote(c(paste(sagelabs$covnames[1],":", 
                  paste(sagelabs$cov.betas[[1]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[2],":",
                  paste(sagelabs$cov.betas[[2]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[3],":", 
                  paste(sagelabs$cov.betas[[3]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[4],":", 
                  paste(sagelabs$cov.betas[[4]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[5],":", 
                  paste(sagelabs$cov.betas[[5]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[6],":",
                  paste(sagelabs$cov.betas[[6]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[7],":",
                  paste(sagelabs$cov.betas[[7]]$problabels[topic,], collapse="\n"))),
          text.cex = 0.8, width = 40
          )
topics <- matrix(c(1, "SPD, M.Schulz", 2, "B90/ Die Grünen", 3, "Mix: Akhanli, Guttenberg, Bayern", 4, "Great Coalition debates", 5, "Diesel scandal", 6, "H.Kohl", 7, "Federal Election results", 8, "Europa, Macron, Schäuble", 9, "Mix: political trends, twitter", 10, "M.Schulz, vs. A.Merkel", 11, "politics & democracy in GER", 12, "Deportation of criminal Refugees", 13, "A.Merkel, election campaign", 14, "Text processing fail", 15, "Israel, antisemitism, D.Trump", 16, "Mix: political talkshows, Refugees", 17, "Debates within SPD", 18, "Election in Niedersachsen", 19, "N.Lammert", 20, "AfD, right-wing radicalism", 21, "German armed forces, v.d.Leyen", 22, "SPD, stuffing debates", 23, "CSU, Söder & Seehover", 24, "Bundespräsident F.-W.Steinmeier", 25, "Election polls", 26, "Jamaica fail, Reelections or GroKo?", 27, "Jamaica Coalition debates", 28, "G20 in Hamburg", 29, "Federal Constitutional Court, Ministry of the Interior", 30, "AfD, F.Petry & Meuthen", 31, "D.Trump, Russia", 32, "AfD, Gauland & Weidel", 33, "German armed forces, Mali", 34, "Mix: Children, Education, Women", 35, "Left- rightwing Terror, police reports", 36, "Mix: people, Germany, democracy", 37, "AfD & DIE LINKE in parliament", 38, "EU policies", 39, "Mix: Terror attacks", 40, "Mix: Metoo, SPD", 41, "Terror attack Berlin (Amri)", 42, "Refugee family reunion", 43, "Mix: studies", 44, "Federal Constitutional Court (NSU, Franco, Terror)", 45, "Mix: minister of the interior, environment", 46, "CDU", 47, "Church", 48, "public (budget) statistics, Education/ Healthcare/ Digital policies", 49, "Turkey", 50, "Höcke, Holocaust"), ncol=2, byrow=T)

topics.df <- as.data.frame(topics) %>%
  transmute(topic_name = paste(V1, V2, sep=": "),
         topic = 1:k) 

1.2. Posterior distribution (gamma)

The theta Matrix is a DxK Matrix that gives us a probability for each topic (K) and each document (D)

# Document-topic probabilities
stmOut %>% tidy("theta") -> theta

To get a better understanding of the distribution of the “highest gamma”, we assign a topic to each document (topic with highest postertior distribution).

top_topics <- theta %>% 
  group_by(document) %>%
  mutate(therank = rank(-gamma)) %>%
  filter(therank == 1) %>%
  select(- therank)

btw.2 <- btw %>%
  mutate(document = articleID) %>%
  merge(.,top_topics, by="document") %>%
  ## Combine with Topic label
  merge(., topics.df, by="topic") %>%
  mutate(allocation = 1) 
ggplot(btw.2, aes(gamma)) +
  geom_density(fill=col[3], alpha = 0.8,
               color = col[3]) +
  labs(title = "Density Plot / Posterior distribution",
       y = "Theta")

1.3 Topic proportions

In order to get an initial overview of the results, the figure below displays the topics ordered by their expected frequency across the corpus. To assign a label to each topic, I looked at the most frequent words in that topic and the most representative articles.

keep <- seq(1:k)
Here, I create a Dataframe that contains the columns means of theta (per topic and covariate level)
frequency <- as.data.frame(colMeans(stmOut$theta)) %>%
  mutate(frequency = colMeans(stmOut$theta),
         topic = topics[,1],
         topic_name=paste(topics[,1],topics[,2], 
                          sep=": ")) %>%
  filter(topic %in% keep)

freq <- tapply(stmOut$theta[,1], stmOut$settings$covariates$betaindex, mean)
freq <- as.data.frame(freq) %>% 
    mutate(site=stmOut$settings$covariates$yvarlevels,
           topic = 1)

for(i in 2:k) {
  freq1 <- tapply(stmOut$theta[,i], stmOut$settings$covariates$betaindex, mean)
  freq1 <- as.data.frame(freq1) %>% 
    transmute(site=stmOut$settings$covariates$yvarlevels,
           topic = i,
           freq = freq1)
  
  freq <- rbind(freq, freq1)
}

freq <- freq %>%
  left_join(., topics.df, by = "topic") %>%
  #filter(topic %in% keep) %>%
  mutate(topic = topic_name) %>%
  left_join(., frequency %>% select(topic, frequency),
            by = "topic")

Next, we can plot the expected proportion of topic use in the overall corpus vs. the expected proportion of topic use for each medium.

p <- ggplot(frequency, aes(x=reorder(topic_name, frequency), y=frequency)) + 
    geom_col(fill=col[3], alpha=0.8) +
    coord_flip() +
    labs(x="", y="expected frequency") +
    theme(axis.text.x = element_text(size=8),
          axis.text.y = element_text(size=9),
          axis.title = element_text(size=10))

ggsave(filename = "../figs/topic_proportion.png", height = 8)
Topic Proportion

Topic Proportion

p <- freq %>%
  mutate(topic =  as.numeric(gsub(":.*$","",topic))) %>% 
  ggplot(aes(reorder(topic_name,topic), freq)) +
  geom_col(fill = col[3]) +
  coord_flip() +
  facet_wrap(~site, ncol = 7) +
  theme(
    #axis.text.y = element_blank(),
          axis.text.y = element_text(size=11),
          axis.title = element_text(size=10)) +
    labs(x="", y="expected frequency") 

ggsave(filename = "../figs/topic_proportion2.png")
Topic Proportion

Topic Proportion

We will conduct our further analysis on 15 selected topics which relate to political parties.

keep <- c(1,2,4,10,13,17,20,22,23,26,27,30,32,37,46)

1.4. Difference in topic prevalence

To identify which of these differences is significant, the conditional expectation of topic prevalence for given document characteristics can be estimated. More specifically, I estimate a linear model, where the documents are observations, the dependent variable is the posterior probability of a topic and the covariates are the metadata of documents (see equation below).

\[ \theta_d=\alpha+\beta x_{ownership}+\epsilon \]

The estimateEffect() uses the method of composition to incorporate uncertainty in the dependent variable, drawing a set of topic proportions from the variational posterior repeated times and compute the coefficients as the average over all results.

effect <- estimateEffect(c(1:k) ~site, stmOut, 
                         metadata = out$meta, uncertainty = "None")

Here, I create a dataframe that contains the results of the estimation.

tables <- vector(mode="list", length = length(effect$topics))

for (i in seq_along(effect$topics)) {
  sims <- lapply(effect$parameters[[i]], function(x) stm:::rmvnorm(500, x$est, x$vcov))
  sims <- do.call(rbind, sims)
  est <- colMeans(sims)
  se <- sqrt(apply(sims,2, stats::var))
  tval <- est/se
  rdf <- nrow(effect$data) - length(est)
  p <- 2*stats::pt(abs(tval), rdf, lower.tail = FALSE)
  topic <- i
  
  coefficients <- cbind(topic, est, se, tval, p)
  rownames(coefficients) <- attr(effect$parameters[[1]][[1]]$est, "names") 
  colnames(coefficients) <- c("topic", "Estimate", "Std. Error", "t value", "p")
  tables[[i]] <- coefficients
}

out1 <- list(call=effect$call, topics=effect$topics, tables=tables)

coeff <- as.data.frame(do.call(rbind,out1$tables))

coeff <- coeff %>% 
  mutate(parameter = rownames(coeff),
         parameter = gsub("site", "", parameter),
         signifcant = ifelse(p <= 0.5,"yes","no")) %>%
  left_join(., topics.df, by="topic")

The following figure shows the regression results for each news page. The coefficients indicate the deviation from the base value of Bild.de (keeping the month equal).

p1 <- coeff %>% 
  filter(topic %in% keep) %>%
  filter(parameter %in% stmOut$settings$covariates$yvarlevels) %>%
  ggplot(aes(x = reorder(topic_name,topic, decreasing=F), y = Estimate, fill=factor(signifcant))) +
  geom_col(alpha = 0.8) +
  scale_fill_manual(values = col[c(5,3)]) +
  scale_x_discrete(position = "top") +
  geom_hline(yintercept = 0, linetype = 2,
             color = "black") +
  coord_flip() +
  facet_wrap(~parameter, ncol = 8, scales = "free_x") +
  labs(x="", fill="significant at the 5% level") +
  theme(legend.position = "top", 
        axis.text.y = element_text(size=9),
        axis.text.x = element_text(angle=90))

ggsave(plot = p1, filename = "../figs/estimates.png", device = "png",width = 9, height = 6,
dpi = 600)
Estimates

Estimates

2. Tone

2.1. Plotquote

The plotQuote function allows to inspect die most common words of a topic for each covariate.

for (topic in keep) {
  
  png(filename = paste0("../figs/plotquote",topic,".png"), width = 300,
      height = 700)
  
  plotQuote(c(paste(sagelabs$covnames[1],":", 
                  paste(sagelabs$cov.betas[[1]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[2],":",
                  paste(sagelabs$cov.betas[[2]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[3],":", 
                  paste(sagelabs$cov.betas[[3]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[4],":", 
                  paste(sagelabs$cov.betas[[4]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[5],":", 
                  paste(sagelabs$cov.betas[[5]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[6],":",
                  paste(sagelabs$cov.betas[[6]]$problabels[topic,], collapse="\n")),
            paste(sagelabs$covnames[7],":",
                  paste(sagelabs$cov.betas[[7]]$problabels[topic,], collapse="\n"))),
          text.cex = 0.8, width = 40,
          main = paste(topics.df$topic_name[topics.df$topic==topic])
          )
  dev.off()
}

2.2. Sentiment analysis

The idea of Sentiment analysis is to determine the attitude of a writer through online text data toward certain topic or the overall tonality of a document.

Lexical or “bag-ofwords” approaches are commonly used. In that approach, the researcher provides pre-defined dictionaries (lists) of words associated with a given emotion, such as negativity. The target text is then deconstructed into individual words (or tokens) and the frequencies of words contained in a given dictionary are then calculated.

2.2.1. Load sentiment dictionary.

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 positive and 1,818 negative words, which sum up to 15,649 positive and 15,632 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.

sent <- c(
  # positive Wörter
  readLines("dict/SentiWS_v1.8c_Negative.txt",
            encoding = "UTF-8"),
  # negative Wörter
  readLines("dict/SentiWS_v1.8c_Positive.txt",
            encoding = "UTF-8")
) %>% lapply(function(x) {
  # Extrahieren der einzelnen Spalten
  res <- strsplit(x, "\t", fixed = TRUE)[[1]]
  return(data.frame(words = res[1], value = res[2],
                    stringsAsFactors = FALSE))
}) %>%
  bind_rows %>% 
  mutate(word = gsub("\\|.*", "", words) %>% tolower,
         value = as.numeric(value)) %>%
  # manche Wörter kommen doppelt vor, hier nehmen wir den mittleren Wert
  group_by(word) %>% summarise(value = mean(value)) %>% ungroup

2.2.2. Apply the dictionary on the artciles.

To carry out the sentiment analysis we filter some documents from the corpus:

  1. Articles that have been assigned a topic with a probability of over 30% (gamma > 0.3).
  2. Articles assigned to one of the above mentioned topics.
  3. Some manual cleanups

After applying these filters, we still have about 2500 articles left to conduct the sentiment analysis. We now take each word in each article and assign a sentiment value for that word.

2.2.3. Check the analysis for a set of example documents.

2.2.4. Calculate sentiment value by document

The sentiment score is calculated based on the weighted polarity values for a word, defined on an interval between -1 and 1. The score is then calculated from the sum of the words in a document (which can be assigned to a word from the dictionary) divided by the total number of words in that document.

\[ \text{SentScore}_d = \frac{|\text{positive polarity score}_d| - |\text{negative polarity score}_d|}{\text{Total Words}_d} \]

sentDF.values <- sentDF %>%
  select(document, word, value, 
         negative, positive,
         negative_d, positive_d) %>%
  group_by(document) %>%
  
  # calculate sum of positive and negative values
  summarise(sum_positive = sum(positive, na.rm = T),
            sum_negative = sum(negative, na.rm = T),
            sum_positive_d = sum(positive_d, na.rm = T),
            sum_negative_d = sum(negative_d, na.rm = T)) %>%

  # calculate diff
  mutate(sent_diff = sum_positive + sum_negative,
         sent_diff_d = sum_positive_d - sum_negative_d) %>%
  
  # combine with dataframe
  left_join(., df, 
            by = "document") %>%
  # calculate sentiment
  mutate(sentiment_d = sent_diff_d / text_length,
         sentiment = sent_diff / text_length) %>%
  
  # generate month & week
  mutate(week = week(date),
         month = month(date),
         year = year(date),
         yearmonth = calculate_month(month,year),
         yearweek = calculate_week(week, year)) 

2.2.5. Plot Sentiment Score

btw.2 %>% 
  mutate(yearmonth = calculate_month(month(date), year(date))) %>%
  group_by(yearmonth) %>%
  summarise(total_obs = n()) -> tally_month

btw.2 %>%
  group_by(site) %>%
  summarise(total_obs = n()) -> tally_site

btw.2 %>% 
  mutate(yearmonth = calculate_month(month(date),year(date))) %>%
  group_by(yearmonth, site) %>%
  summarise(total_obs = n()) -> tally_month_site

The following figure shows the results of the analysis for each topic on a monthly basis, aggregated on all newspaper. Each sentiment value is weighted by the relative share of the topic in the overall reporting of that month.

\[ \text{w} = \frac{\text{# of documents per month & topic}}{\text{# of documents per month}} \]

\[ \text{weighted sentiment} = \text{sentiment} * \text{w} \]

Some conclusions can be drawn from this illustration. First of all, it can be seen that, on average, all topics are discussed almost exclusively negatively. An exception is topic 27 concerning the Jaimiaca coalition negotiations, which shows a positive sentiment value for a short period of time (October 2017). In the following month (November 2017), after it became clear that there would be no coalition between the CDU/CSU, FDP and Die Grünen, the value of this topic as well as that of topic 26 drops rapidly.

Concerning the issues that discuss the great coaltion between CDU/CSU and SPD, it is evident that the overall tone is in which this topic is discussed is generally decreasing from November 2017 to January 2018, but in the following February, the sentiment value of this topic rises. However, the sentiment score of topics that deal with the SPD alone (1, 17, 22) is diminishing in the course of time, with topic 17 recording the largest decline. The topic, which contains the CDU in isolation (46), is rather zigzagging, with a low peak in October 2017.

By Month & Topic
Monthly Sentiment Score

Monthly Sentiment Score

By Site and Topic

The following figure shows the results of the analysis for each topic and each newspaper, aggregated over time. Each sentiment value is weighted by the relative share of the topic in the overall political news coverage of that online newspaper.

\[ \text{w} = \frac{\text{# of obs per newspaper & topic}}{\text{# of obs per newspaper}} \]

\[ \text{weighted sentiment} = \text{sentiment} * \text{w} \]

Sentiment Score by Newssite

Sentiment Score by Newssite

Radar plot
require(ggiraph)
require(ggiraphExtra)
Radarplot Sentiment Score

Radarplot Sentiment Score

3. Compare with polls

We use the data from the “Sonntagsumfrage” (Sunday survey) from infratest dimap. The institution regularly asks at least 1000 German citizens the question: “Which party would you choose if federal elections take place next Sunday?” The survey thus measures the current election tendencies and therefore reflects an intermediate state in the opinion-forming process of the electoral population.

# Import and prepare survey data
load("../output/polls.Rda")

polls <- table_long %>%
  mutate(yearmonth = calculate_month(month(Datum),year(Datum)),
         yearweek = calculate_week(week(Datum), year(Datum))) 

Poll Results ###### Normalization The comparison with the sentiment value of individual topics is intended to show whether there is a correlation between the current tendency to vote and the type of reporting. For this purpose, the monthly average of both time series is calculated and rescaled to an interval between 0 and 1.

The analysis in this section is done in two steps: First, the standardised time series are compared graphically. In the second step, the cross-correlation between two time series is calculated.

\[ z_i = \frac{x_i-\text{min}(x)}{\text{max}(x)-\text{min}(x)} \] where \(x = (x_1,...,x_n)\) and \(z_i\) is the \(i^{th}\) normalized data.

polls_monthly <- polls %>%
  group_by(yearmonth,party) %>%
  summarise(mean_val = mean(value, na.rm=T)) %>%
  ungroup() %>%
  filter(!party %in% c("Piraten", "Sonstige")) %>%
  filter(mean_val != "NaN") %>%
  select(party, yearmonth, mean_val) %>%
  spread(party, mean_val) %>%
  filter(yearmonth > as.Date("2017-05-01")) %>%
  filter(yearmonth < as.Date("2018-03-01")) %>%
  mutate(AfD = normalize_data(AfD),
         FDP = normalize_data(FDP),
         Grüne = normalize_data(Grüne),
         Linke = normalize_data(Linke),
         SPD = normalize_data(SPD),
         Union = normalize_data(Union))

polls_monthly %>%
  gather(key = party, value = poll, AfD:Union) %>%
  ggplot(aes(yearmonth, poll, color = party)) +
  geom_line() +
  scale_color_manual(values =  c("deepskyblue", "gold", "limegreen", "deeppink", "red", "black")) +
  labs(x="", y="", title = "Survey Value", 
       subtitle = "weekly", color = "") +
  geom_vline(xintercept = as.Date("2017-09-24"), 
             linetype = 2, color = col[3]) +
  scale_x_date(breaks = date_breaks("1 month"), labels=date_format("%b", tz="CET")) +
    theme(axis.text.x = element_text(size = 10))

Normalize topic-sentiment value

### by time
group1 <- sentDF.values %>%
  group_by(yearmonth, topic_name, topic) %>%
  summarise(sentiment = mean(sentiment, na.rm=T),
            obs = n()) %>%
  ungroup() %>%
  left_join(., tally_month, by="yearmonth") %>%
  mutate(w = obs / total_obs,
         sentiment_w = sentiment * w) 
  
group1 <- group1 %>%
  select(yearmonth, topic, sentiment_w) %>%
  spread(topic, sentiment_w) %>%
  lapply(normalize_data) %>%
  as.data.frame() %>%
  gather(topic, sentiment_n, -yearmonth) %>%
  mutate(topic = as.numeric(gsub("X","", topic))) %>%
  merge(., group1, by = c("topic", "yearmonth"))

### by time and site
group2 <- sentDF.values %>%
  group_by(yearmonth, topic_name, topic, site) %>%
  summarise(sentiment = mean(sentiment, na.rm=T),
            obs = n()) %>%
  ungroup() %>%
  left_join(., tally_month_site, by=c("yearmonth","site")) %>%
  mutate(w = obs / total_obs,
         sentiment_w = sentiment * w)

group2 <- group2 %>%
  select(yearmonth, topic, site, sentiment_w) %>%
  spread(topic, sentiment_w) %>%
  lapply(normalize_data) %>%
  as.data.frame() %>%
  gather(topic, sentiment_n, -yearmonth, -site) %>%
  mutate(topic = as.numeric(gsub("X","", topic))) %>%
  merge(., group2, by = c("topic", "yearmonth","site"))

3.1. Cross-Correlation

plot_ccf <- function(corr1, corr2, ...){
  
  # Create empty matrix
  corr <- expand.grid(names(corr1), names(corr2))
  corr$correlation <- NA
  
  x <- 1
  
  for (corr1Id in 1:ncol(corr2)) {
    c1 = corr2[,corr1Id]
    #print(names(corr2[,corr1Id]))
    
    for(corr2Id in 1:ncol(corr1)) {
      c2 = corr1[,corr2Id]
      #print(names(corr1[,corr2Id]))
      correlation = ccf(c1, c2, lag.max = 0,
                      na.action = na.contiguous, plot=F)
      corr[x,3] <- correlation$acf[1]
    
      x<- x+1
    }
  }
      
      ### Plot Corr ##
      ggplot(corr, aes(Var2, Var1,
                 fill = correlation,
                 label = round(correlation,2))) +
        geom_tile(color="white", ...) +
        geom_text(size = 3.5, color = col[5]) +
        viridis::scale_fill_viridis(limits=c(-1, 1)) +
        labs(x="", y="", fill="",
             title = "",
             ...)
}
3.1.1. Overall
group1 %>%
  select(yearmonth, topic_name, sentiment_n) %>%
  spread(topic_name, sentiment_n) %>%
  select(- yearmonth) -> corr1

polls_monthly %>% select(- yearmonth) -> corr2

p <- plot_ccf(corr1, corr2, alpha = 0.8,
              subtitle = "Election polls vs. topic sentiment value (monthly)")
## Warning: Ignoring unknown parameters: subtitle
ggsave(filename = "../figs/ccf.png", plot = p,
       height = 6, width = 8, dpi = 600)
Cross-correlation

Cross-correlation

3.1.2. By Media

3.2. Graphical analysis

For each party, we plot the topics that most strongly correlate with the poll data.

plot_Poll_Sent <- function(topics_number, party_str, color_str) {
  
  ### Plot 1  
  plot1 <- group1 %>%
    filter(topic %in% topic_number)
  
  p1 <- ggplot() +
    geom_col(data=plot1, aes(yearmonth, sentiment_n,
                             group = topic_name,
                             fill = topic_name),
             position = "dodge",
             alpha = 0.5) +
    scale_fill_manual(values = col) +
    
    geom_line(data=plot1, aes(yearmonth, 
                              sentiment_n,
                              group = topic_name,
                              color = topic_name),
              show.legend = F) +
    scale_color_manual(values = col) +
    guides(fill = guide_legend(ncol=3)) +
    
    # Add poll line
    geom_line(data = polls_monthly,
              aes_string("yearmonth", party_str),
              color = color_str,
                            size = 0.9,
              linetype = 2) +
    labs(y="", x="", fill="",
         title = "Sentiment & Poll Values",
         subtitle = party_str,
         caption = "Both scores are normalized") +
    scale_x_date(breaks = date_breaks("1 month"), 
                 labels=date_format("%B", tz="CET")) +
    theme(legend.position = "bottom") 
  
  ### Plot 2
  plot2 <- group2 %>%
    filter(topic %in% topic_number)
  
  p2 <- ggplot() +
    # geom_col(data = plot2,
    #          aes(yearmonth, sentiment_n, 
    #              group = topic_name,
    #              fill = topic_name),
    #          alpha = 0.5,
    #          position = "dodge") +
    # scale_fill_manual(values = col) +
    
    geom_line(data = plot2,
              aes(yearmonth, sentiment_n,
                  group = topic_name,
                  color = topic_name),
              show.legend = F) +
    facet_wrap(~site, ncol = 2) +
    scale_color_manual(values = col) +
    
    # Add poll line
    geom_line(data = polls_monthly, 
              aes_string("yearmonth", party_str),
              color = color_str,
              size = 0.9,
              linetype = 2) +
    labs(y="", x="", fill="") +
    scale_x_date(breaks = date_breaks("1 month"), 
                 labels=date_format("%B", tz="CET")) +
    theme(legend.position = "none",
        axis.text.x = element_text(angle = 90))
  
  p1 + p2 + plot_layout(ncol = 1, heights = c(1,4))
  
}
3.2.1. CDU
topic_number <- c(10,13,23,26)

### Plot 1  
plot_Poll_Sent(topic_number, "Union", "black")

SPD

topic_number <- c(1,17,22)

plot_Poll_Sent(topic_number, "SPD", "black")

3.2.3. AfD
topic_number <- c(17,22, 30)

plot_Poll_Sent(topic_number, "AfD", "black")

FDP

topic_number <- c(23,26,27)

plot_Poll_Sent(topic_number, "FDP", "black")

Grüne

topic_number <- c(26,27,17)

plot_Poll_Sent(topic_number, "Grüne", "black")

DIE LINKE

topic_number <- c(1,17,22)

plot_Poll_Sent(topic_number, "Linke", "black")